# Importing a geographic health disparities data set from github repo
urlfile = 'https://raw.githubusercontent.com/eitanaka/DATS6101_Project1_Team2/main/dataset/geographic_health_disparities.csv'
geo_health_df <- read_csv(url(urlfile))
This paper explores the relationships between four health conditions (depression, poor mental health, lack of sleep, and lack of physical activity) using data from the Centers for Disease Control and Prevention (CDC) PLACES project. The paper begins by examining national-level correlations between the four health conditions and identifying the strongest correlations. State-level correlations are then explored using a US map that highlights the correlations between mental health and both lack of sleep and lack of physical activity across all US states.
Next, multiple linear regression analyses are conducted to isolate the effect of each lifestyle factor (lack of sleep and lack of physical activity) on each health outcome (depression and poor mental health) while controlling for the effects of the other independent variable. The results suggest that lifestyle factors influence poor mental health more than depression and that, of these lifestyle factors, lack of sleep has a greater influence than lack of physical activity.
Our initial research goal is to investigate the geographic distribution of health conditions in the United States during the year 2020, identify emerging patterns of poor health, and determine which health risk behaviors should be targeted for treatment.
Our dataset is taken from a collaborative project between the CDC and the Robert Wood Johnson Foundation called “PLACES: Local Data for Better Health.” The project aims to provide health data for small areas across the country that might otherwise be neglected: “This allows local health departments and jurisdictions, regardless of population size and rurality, to better understand the burden and geographic distribution of health measures in their areas and assist them in planning public health interventions.” More specifically we relied on the Census Tract Data 2022 release of the PLACES dataset (which is based off of data from the 2020 US Census).
We use the dataset to examine tract-level geographic health disparities in the US in 2020. This dataset offers model-based census tract-level estimates of the prevalence of 29 health outcomes, preventive service usages, chronic disease-related health risk behaviors, and health statuses as part of the 2020 U.S Census. It covers the entire United States - 50 states and the District of Columbia (DC) - at the county, place, census tract, and ZIP Code Tabulation Area levels. At these four geographic levels, it uniformly offers information for local locations. The Epidemiology and Surveillance Branch of the Centers for Disease Control and Prevention (CDC), Division of Population Health, provided the estimates.
These estimates can be used to identify emerging health problems and to help develop and carry out effective, targeted public health prevention activities. Because the small area model cannot detect effects due to local interventions, users are cautioned against using these estimates for program or policy evaluations. Data sources used to generate these model-based estimates include Behavioral Risk Factor Surveillance System (BRFSS) 2020 or 2019 data, Census Bureau 2010 population data, and American Community Survey 2015–2019 estimates.
Initial Question 1: Is there a significant relationship between the prevalence of depression, poor mental health, lack of sleep, lack of physical activity and geographic location in the US?
Initial Question 2: Are depression and other variables correlated?/Is there any correlation between depression and the other three variables?
Research Question: How are lack of sleep and lack of physical activity impacting human mental health and depression?/To what extent are lack of sleep and lack of physical activity correlated with poor mental health and depression among tracts and states in the US?
After some analysis and research, we decided to use some specific data, which we had to extract and clean from our original dataset. We created a subset keeping only some variables by reshaping dataset from long to wide: There are 10 variables in our new subset: Year, State Abbreviation, County Name, County FIPS, Location Name, Total Population (per tract). Furthermore, filtering only rows with health measure of interest. These include Depression, Mental Health (MHLTH), Lack of Sleep (Sleep), Lack of Leisure Time Physical Activity (LPA) (estimated % of tract with condition).
There are four main key variables that we are considering for this dataset. We got two independent variable that includes Lack of physical activity and lack of sleep. Dependent variable includes Depression and Mental health. Here, we made a logo which easily shows how the independent and dependent variables can have a correlation with one another.Additionally, looking at the symbol, it shows that there is a strong correlation among dependent variables(Depression and Mental Health) and we could even see same kind of correlation among independent variables (lack of Sleep and no Physical Activities).We will check that by performing some EDA and creating plots to analyse the correlation among all four variables and look at their relationships.
This Exploratory Data Analysis (EDA) section aims to examine and understand the relationships between various health-related variables. We begin by performing basic EDA, which involves examining the data types and the number of observations. We then check for missing values, outliers, and data distributions, addressing any issues that may impact the analysis. Next, we conduct descriptive statistics to understand the central tendency, variability, and spread. A series of visualizations and summary statistics for each variable at the national and state levels follow this. We also investigate correlations between variables to identify potential predictors for modeling. Overall, this comprehensive EDA process helps us better understand the dataset, uncover patterns, and detect potential anomalies, laying the groundwork for building accurate and reliable predictive models.
In the Basic EDA section, we scrutinize the dataset’s structure and composition by focusing on data types, variable names, and the number of observations. Subsequently, we address missing values and identify outliers to ensure our data is reliable and robust for further analysis. This crucial step establishes a solid foundation for a deeper exploration of the relationships between health-related variables in the subsequent phases of the EDA process.
# Look at the datatypes
str(geo_health_df)
## spc_tbl_ [72,337 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Year : num [1:72337] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ StateAbbr : chr [1:72337] "AL" "AL" "AL" "AL" ...
## $ StateDesc : chr [1:72337] "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ CountyName : chr [1:72337] "Baldwin" "Barbour" "Chambers" "Chilton" ...
## $ CountyFIPS : num [1:72337] 1003 1005 1017 1021 1031 ...
## $ LocationName : num [1:72337] 1.00e+09 1.01e+09 1.02e+09 1.02e+09 1.03e+09 ...
## $ TotalPopulation : num [1:72337] 4302 4264 3619 3808 2117 ...
## $ Data_Value.DEPRESSION: num [1:72337] 27.6 23.1 25.9 28.2 26.7 26.4 28 27.8 28.3 21.9 ...
## $ Data_Value.LPA : num [1:72337] 29.5 37.9 35.6 32.3 33.9 24.2 28.9 30.7 32.3 16.4 ...
## $ Data_Value.SLEEP : num [1:72337] 36.2 46.4 41.4 39.9 42 36.1 36.9 40.4 41.1 30.4 ...
## $ Data_Value.PHLTH : num [1:72337] 13.1 15.7 15.9 14 13.7 10.4 12.4 13.9 14.8 6.7 ...
## $ Data_Value.MHLTH : num [1:72337] 17.6 18.3 17.3 17.5 17.3 15.2 17.1 17.6 18.2 13.2 ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. StateAbbr = col_character(),
## .. StateDesc = col_character(),
## .. CountyName = col_character(),
## .. CountyFIPS = col_double(),
## .. LocationName = col_double(),
## .. TotalPopulation = col_double(),
## .. Data_Value.DEPRESSION = col_double(),
## .. Data_Value.LPA = col_double(),
## .. Data_Value.SLEEP = col_double(),
## .. Data_Value.PHLTH = col_double(),
## .. Data_Value.MHLTH = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# A number of observation
nrow(geo_health_df)
## [1] 72337
Our dataset consists of 72,337 observations and 12 variables, including both numeric and character data types. These variables provide information on the year, state abbreviation, state description, county name, county FIPS code, location name, and total population. Moreover, the dataset includes values for depression, leisure-time physical inactivity (LPA), sleep, poor general health (PHLTH), and poor mental health (MHLTH). This data represents various health-related indicators across different locations within the United States, enabling further analysis to identify trends and relationships among these variables.
# check for the missing value
sum(is.na(geo_health_df))
## [1] 0
Upon examining the dataset, we found no missing values across all variables. This completeness is a significant advantage, as it ensures the reliability and robustness of our analysis without requiring imputation or other techniques to address missing data. Consequently, we can confidently proceed with our exploration of the relationships between health-related variables.
We assess the presence of outliers in the health-related variables by analyzing their distributions. Both the original and the data without outliers display normal distributions for Depression, MHLTH, and Sleep. In contrast, LPA shows a small right-skewed distribution. Comparing the dataset reveals minimal differences between those with and without outliers. Therefore, we choose to analyze the dataset containing outliers, expecting a minimal impact on our findings while enabling a comprehensive understanding of the relationships among the health-related variables.
# check for outliers
outlier_Depression <- outlierKD2(geo_health_df, geo_health_df$Data_Value.DEPRESSION)
outlier_MLHTH <- outlierKD2(geo_health_df, geo_health_df$Data_Value.MHLTH)
outlier_SLEEP <- outlierKD2(geo_health_df, geo_health_df$Data_Value.SLEEP)
outlier_LPA <- outlierKD2(geo_health_df, geo_health_df$Data_Value.LPA)
# Create a list containing the data frames by each state
data_by_state <- split(geo_health_df, geo_health_df$StateAbbr)
In the descriptive statistics section, our objective is to understand the central tendency, variability, and spread of various health-related indicators. We analyze the dataset at national and state levels, focusing on critical aspects of health and well-being. By calculating summary statistics, standard deviations, and mean values for each state, we create visual representations using maps and boxplots. This comprehensive analysis allows us to identify trends, patterns, and regional disparities, paving the way for a deeper exploration of the relationships and underlying factors influencing these health-related variables.
summary_nation_DEPRESSION <- summary(geo_health_df$Data_Value.DEPRESSION)
summary_nation_MHLTH <- summary(geo_health_df$Data_Value.MHLTH)
summary_nation_SLEEP <- summary(geo_health_df$Data_Value.SLEEP)
summary_nation_LPA <- summary(geo_health_df$Data_Value.LPA)
summary_nation_DEPRESSION_df <- tidy(summary_nation_DEPRESSION)
summary_nation_MHLTH_df <- tidy(summary_nation_MHLTH)
summary_nation_SLEEP_df <- tidy(summary_nation_SLEEP)
summary_nation_LPA_df <- tidy(summary_nation_LPA)
summary_df <- rbind(summary_nation_DEPRESSION_df, summary_nation_MHLTH_df, summary_nation_SLEEP_df, summary_nation_LPA_df)
summary_df$Variable <- c("DEPRESSION", "MHLTH", "SLEEP", "LPA")
summary_df <- summary_df[,c("Variable", "minimum", "q1", "median", "mean", "q3", "maximum")]
kable(summary_df, align = "c",
col.names = c("Var", "Min", "1Q", "Median", "Mean", "3Q", "Max"),
caption = "Summary Statistics of Health Variables")
| Var | Min | 1Q | Median | Mean | 3Q | Max |
|---|---|---|---|---|---|---|
| DEPRESSION | 8.5 | 17.9 | 20.4 | 20.5 | 22.9 | 37.8 |
| MHLTH | 6.1 | 13.2 | 15.0 | 15.1 | 16.9 | 33.0 |
| SLEEP | 19.8 | 30.7 | 33.5 | 34.0 | 36.6 | 54.4 |
| LPA | 7.8 | 18.7 | 23.6 | 24.5 | 29.3 | 63.7 |
At the national level, the summary statistics reveal the following patterns for the health-related variables:
Depression: The percentage of the population affected by depression ranges from 8.5% to 37.8%, with a median of 20.4% and a mean of 20.5%.
Mental Health (MHLTH): The percentage of people experiencing poor mental health for 14 or more days ranges from 6.1% to 33.0%, with a median of 15.0% and a mean of 15.1%.
Sleep: The percentage of the population with inadequate sleep ranges from 19.8% to 54.4%, with a median of 33.5% and a mean of 34.0%.
Leisure-time Physical Inactivity (LPA): The percentage of the population without adequate leisure-time physical activities ranges from 7.8% to 63.7%, with a median of 23.6% and a mean of 24.5%.
These results highlight the necessity of examining the complex relationships between these variables to better understand their correlation and develop effective strategies for improving public health.
The map-based analysis presents an overview of the mean values for four health conditions across U.S. states. These color-coded maps utilize darker shades to indicate higher mean values for each health condition.
The first map displays mean depression values, revealing a higher prevalence in the eastern region, particularly around West Virginia and the western part of Washington. The second map shows the average rates of poor mental health, which are more common in the eastern states than in the West. The third map illustrates average sleep deprivation levels, demonstrating a higher prevalence in the eastern part of the country compared to the western region. The fourth map highlights the percentages of individuals engaging in less physical activity, with the Southeast displaying exceptionally high rates.
These findings suggest that, on average, the East experiences worse health outcomes than the West concerning the variables of interest. This information is crucial for understanding regional health disparities and informing targeted public health interventions.
# Summary statistic about percentage of population affected with depression in each state
summary_by_state_DEPRESSION <- lapply(data_by_state, function(x) summary(x$Data_Value.DEPRESSION))
# head(summary_by_state_DEPRESSION,3)
# tail(summary_by_state_DEPRESSION,3)
# A list of standard deviation for percentage of population affected with Depression in each state
sd_by_state_DEPRESSION<- lapply(data_by_state, function(x) sd(x$Data_Value.DEPRESSION))
# head(sd_by_state_DEPRESSION,3)
# tail(sd_by_state_DEPRESSION,3)
# A data frame containing mean values of the incidence rate of Depression for each country in each state
mean_by_state_DEPRESSION <- lapply(data_by_state, function(x) mean(x$Data_Value.DEPRESSION))
mean_df_DEPRESSION <-data.frame(State = names(mean_by_state_DEPRESSION), Mean_Depression=unlist(mean_by_state_DEPRESSION))
mean_df_DEPRESSION["fips"] <- fips(mean_df_DEPRESSION$State)
# A map of the U.S. showing average percent of people suffering from DEPRESSION for each country in each state
plot_usmap(data=mean_df_DEPRESSION, values="Mean_Depression", labels = TRUE) +
scale_fill_continuous(low = "white", high = "red", guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
ggtitle("Mean Depression by State") +
guides(fill = guide_colorbar(title = "Mean %",
title.position = "top",
title.hjust = 0.5,
label.position = "left",
label.hjust = 0.5))
# Summary of the percentage of the population in each country in each state with a health status of 14 or more days with poor mental health
summary_by_state_MHLTH <- lapply(data_by_state, function(x) summary(x$Data_Value.MHLTH))
# head(summary_by_state_MHLTH,3)
# tail(summary_by_state_MHLTH,3)
# The list containing standard deviation per state: The percentage of the population in each country with a health status of 14 or more day with poor mental health
sd_by_state_MHLTH <- lapply(data_by_state, function(x) sd(x$Data_Value.MHLTH))
# head(sd_by_state_MHLTH,3)
# tail(sd_by_state_MHLTH,3)
# The data frame containing mean value per state: Percentage of the population in each country with poor mental health for 14 or more days as a health status.
mean_by_state_MHLTH <- lapply(data_by_state, function(x) mean(x$Data_Value.MHLTH))
mean_df_MHLTH <- data.frame(State = names(mean_by_state_MHLTH), Mean_MHLTH = unlist(mean_by_state_MHLTH))
mean_df_MHLTH["fips"] <- fips(mean_df_MHLTH$State)
# Map of the United States plotting the mean percent incidence of the population with poor mental health by state.
plot_usmap(data=mean_df_MHLTH, values="Mean_MHLTH", labels = TRUE) +
scale_fill_continuous(low = "white", high = "green", guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
ggtitle("Mean Mental Health by State") +
guides(fill = guide_colorbar(title = " Mean %",
title.position = "top",
title.hjust = 0.5,
label.position = "left",
label.hjust = 0.5))
# A list of summary statistic per state: Each data is a percentage of sleep disturbance in each country
summary_by_state_SLEEP <- lapply(data_by_state, function(x) summary(x$Data_Value.SLEEP))
# head(summary_by_state_SLEEP,3)
# tail(summary_by_state_SLEEP,3)
# A list of standard diviation per state: Each data is a percentage of sleep disturbance in each country
sd_by_state_SLEEP <- lapply(data_by_state, function(x) sd(x$Data_Value.SLEEP))
# head(sd_by_state_SLEEP,3)
# tail(sd_by_state_SLEEP,3)
# A data frame about mean value per state: Each data is a percentage of sleep disturbance in each country
mean_by_state_SLEEP <- lapply(data_by_state, function(x) mean(x$Data_Value.SLEEP))
mean_df_SLEEP <- data.frame(State = names(mean_by_state_SLEEP), Mean_Sleep = unlist(mean_by_state_SLEEP))
mean_df_SLEEP["fips"] <- fips(mean_df_SLEEP$State)
# Map of the United States plotting the mean percent incidence of the population with poor sleep by state.
plot_usmap(data=mean_df_SLEEP, values="Mean_Sleep", labels = TRUE) +
scale_fill_continuous(low = "white", high = "purple", guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
ggtitle("Mean Sleep by State") +
guides(fill = guide_colorbar(title = "Mean %",
title.position = "top",
title.hjust = 0.5,
label.position = "left",
label.hjust = 0.5))
# A list of summary statistic per state: Data are percent of population of people without leisure time in each country
summary_by_state_LPA <- lapply(data_by_state, function(x) summary(x$Data_Value.LPA))
# head(summary_by_state_LPA,3)
# tail(summary_by_state_LPA,3)
# A list of standard deviation per state: data are percent of population of people without leisure time in each country
sd_by_state_LPA <- lapply(data_by_state, function(x) sd(x$Data_Value.LPA))
# head(sd_by_state_LPA,3)
# tail(sd_by_state_LPA,3)
# A data frame about mean value per state: data are percent of population of people without leisure time in each country
mean_by_state_LPA <- lapply(data_by_state, function(x) mean(x$Data_Value.LPA))
mean_df_LPA <- data.frame(State = names(mean_by_state_LPA), Mean_LPA = unlist(mean_by_state_LPA))
mean_df_LPA["fips"] <- fips(mean_df_LPA$State)
# Map of the United States plotting the mean percent incidence of the population without leisure time by state.
plot_usmap(data=mean_df_LPA, values="Mean_LPA", labels = TRUE) +
scale_fill_continuous(low = "white", high = "cyan", guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
ggtitle("Mean Lack of Physial Activity by State") +
guides(fill = guide_colorbar(title = "Mean %",
title.position = "top",
title.hjust = 0.5,
label.position = "left",
label.hjust = 0.5))
# create mean_df for all four conditions
mean_df <- merge(merge(mean_df_DEPRESSION, mean_df_MHLTH, by=c("State", "fips")), mean_df_SLEEP, by=c("State", "fips"))
mean_df <- merge(mean_df, mean_df_LPA, by=c("State", "fips"))
colnames(mean_df)[3:6] <- c("Depression", "MHLTH", "Sleep", "LPA")
head(mean_df)
The boxplot analysis provides a comprehensive visualization of the distribution and spread of four health-related variables across U.S. states. Each point of the boxplots represents state-level mean values, showcasing variations just between states.
Key features of the boxplots include the largest and smallest values for each health variable: Depression is highest in West Virginia (WV) and lowest in Hawaii (HI); leisure-time physical inactivity (LPA) peaks in Kentucky (KY) and reaches a minimum in Utah (UT); poor mental health (MHLTH) is most prevalent in WV and least common in South Dakota (SD); and sleep deprivation is highest in HI and lowest in Minnesota (MN).
Among the health variables, sleep deprivation has the largest mean value, followed by LPA, Depression, and MHLTH. This boxplot analysis highlights the disparities in health outcomes across states, which is essential for understanding regional differences and informing targeted public health interventions.
# create a long format of the data
mean_df_long <- gather(mean_df, key = "Variable", value = "Value", -State, -fips)
# group data by Variable and get the max and min values for each group
max_min_df <- mean_df_long %>% group_by(Variable) %>%
slice(which.max(Value), which.min(Value)) %>% ungroup()
## create a box plot for each variable and facet by variable
ggplot(mean_df_long, aes(x = "", y = Value, fill = Variable)) +
geom_boxplot() +
geom_jitter(aes(color = Variable), width = 0.2, size = 2) +
facet_wrap(~Variable, ncol = 4, scales = "fixed") +
scale_fill_manual(values = c("pink", "cyan", "lightgreen", rgb(200, 162, 200, maxColorValue = 255))) +
scale_color_manual(values = c("darkred", "darkblue", "darkgreen", "darkorchid")) +
labs(title = "Health Condition Distributions", x="Health Conditions", y = "% of State Pop. with Condition") +
geom_text(data = max_min_df, aes(x = 1.25, y = Value, label = State), size = 4, fontface = "bold", hjust = -0.2, color = "black") # add state labels for max and min values
We begin to investigate the relationship between the four health conditions by generating a mixed correlation matrix of the national-level correlations between our variables. We utilize the corrplot() method to generate a heat map for the correlation matrix. This is a helpful tool to see how different variables are correlated. Another method to display correlations across variables is to create a mixed correlation heat map using the corrplot.mixed() function. The strongest correlations are between Depression and MHLTH (0.75), Sleep and LPA (0.64), MHLTH and Sleep (0.60), and MHLTH and LPA (0.59). In other words, states with high levels of depression are also likely to have high levels of poor mental health, states with high levels of inadequate sleep are also likely to have high levels of inadequate physical activity. And those states with high levels of poor mental health are likely to have high levels of both inadequate sleep and inadequate physical activity. There is the least amount of correlation between Depression and Sleep (0.22). Overall, using these procedures can help us examine the connections between our data’s components/variables and find new predictors for modeling.
# Compute the correlation matrix
# create national correlation matrix
cor_matrix <- cor(mean_df[ , c("Depression", "MHLTH", "Sleep", "LPA")])
# create national correlation heat map
corrplot(cor_matrix, method = "color")
# create mixed national correlation heat map
mixed_cor_heat_map <- corrplot.mixed(cor_matrix,
main = "Correlation Between Health Conditions (National)",
mar = c(0,0,2,0))
mixed_cor_heat_map
# Create a data frame containing state, FIPs code, cor_MHLTH_sleep, and cor_MHLTH_LPA
cor_by_state_matrix <- lapply(data_by_state, function(state) cor(state[c("Data_Value.DEPRESSION", "Data_Value.MHLTH", "Data_Value.SLEEP", "Data_Value.LPA")]))
cor_by_state_df <- data.frame(
state = names(cor_by_state_matrix),
MHLTH_SLEEP = sapply(cor_by_state_matrix, function(state) state[2,3]),
MHLTH_LPA = sapply(cor_by_state_matrix, function(state) state[2,4])
)
cor_by_state_df["fips"] <- fips(cor_by_state_df$state)
We create two scatter plots to explore the relationships between two health risk behaviors (lack of sleep and lack of physical activity) and two health outcomes (Depression and MHLTH). In the first scatter plot, we plot the percentage of people who report lacking sleep on the x-axis, and the percentage of people who report Depression and MHLTH on the y-axis. We color the points by health outcome (red for Depression and green for MHLTH) to help differentiate the two. Additionally, we add a line of best fit to the scatter plot using geom_smooth(). This can help show the general trend in the data and highlight any relationships that might exist. In the second scatter plot, we plot the percentage of people who report lacking physical activity on the x-axis, and the percentage of people who report Depression and MHLTH on the y-axis. We again color the points by health outcome (red for Depression and green for MHLTH), and add a line of best fit to the scatter plot using geom_smooth(). The correlations between various variables in our data can be better understood with these scatter plots. The scatter plots display our national-level data, with any patterns or connections being highlighted.
The lines of best fit are straight line that are calculated to minimize the difference between the predicted values and the actual values of the y-variable (health outcomes) given a certain value of the x-variable (lack of sleep or lack of physical activity). The slope of the line indicates the direction and strength of the relationship between the x and y variables. If the slope is positive, it means that as the x-variable increases, the y-variable also tends to increase. If the slope is negative, it means that as the x-variable increases, the y-variable tends to decrease. The steeper the slope, the stronger the relationship between the x and y variables. The intercept of the line represents the predicted value of the y-variable when the x-variable is equal to zero. However, in the case of these plots, a zero value of either variable is not possible. Therefore, the intercept is not particularly informative in this context. It’s worth noting that the line of best fit is just one way to model the relationship between two variables, and it assumes a linear relationship between the variables. It’s possible that a different type of relationship, such as a curve or a logarithmic relationship, might better fit the data. Nonetheless, the line of best fit is a useful tool to visually represent a potential relationship between two variables.
# create scatter plots
# sleep scatter plot
ggplot(mean_df, aes(x = Sleep)) +
geom_point(aes(y = Depression, color = "Depression")) +
geom_point(aes(y = MHLTH, color = "MHLTH")) +
scale_color_manual(name = "Health Outcomes", values = c("Depression" = "red", "MHLTH" = "green")) +
labs(x = "% Lacking Sleep",
y = "% With Health Outcomes",
title = "Correlation Between Lack of Sleep and Health Outcomes") +
geom_smooth(aes(y = Depression, color='black'), method = "lm", se = TRUE) +
geom_smooth(aes(y = MHLTH, color='black'), method = "lm", se = TRUE)
# lpa scatter plot
ggplot(mean_df, aes(x = LPA)) +
geom_point(aes(y = Depression, color = "Depression")) +
geom_point(aes(y = MHLTH, color = "MHLTH")) +
scale_color_manual(name = "Health Outcomes", values = c("Depression" = "red", "MHLTH" = "green")) +
labs(x = "% Lacking Physical Activity",
y = "% With Health Outcomes",
title = "Correlation Between Lack of Physical Activity and Health Outcomes") +
geom_smooth(aes(y = Depression, color='black'), method = "lm", se = TRUE) +
geom_smooth(aes(y = MHLTH, color='black'), method = "lm", se = TRUE)
These two visualizations show the correlation between MHLTH (mental health) and Sleep and between MHLTH and LPA (physical activity) across the United States. Each state is colored according to the correlation coefficient between these variables, with darker blue indicating a stronger positive correlation and white indicating no correlation or a negative correlation. The color scale is shown on the right side of each map. States with missing data are colored in grey.The titles of the two maps indicate which variables are being compared. From the US maps, we can conclude that there is a positive correlation between mental health (MHLTH) and both sleep and physical activity (LPA) across most states in the US. States with darker blue colors indicate stronger positive correlations between mental health and sleep or mental health and physical activity. On the other hand, states with no correlation or negative correlation are colored white or shades of blue closer to white.Therefore, these maps suggest that sleep and physical activity are positively associated with mental health in most US states, indicating that promoting healthy sleep habits and physical activity can potentially improve mental health outcomes. However, it is important to keep in mind that correlation does not necessarily imply causation, and further research would be needed to establish a causal relationship between these variables.
# The map of US about correlation between MHLTH and Sleep.
plot_usmap(data=cor_by_state_df, values="MHLTH_SLEEP", labels = TRUE) +
scale_fill_continuous(low = "white", high = "blue", guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
ggtitle("Correlation Betweeen Mental Health and SLEEP") +
guides(fill = guide_colorbar(title = "Correlation",
title.position = "top",
title.hjust = 0.5,
label.position = "left",
label.hjust = 0.5))
# The map of US about correlation between MHLTH and LPA
plot_usmap(data=cor_by_state_df, values="MHLTH_LPA", labels = TRUE) +
scale_fill_continuous(low = "white", high = "blue", guide = FALSE) +
scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) +
ggtitle("Correlation Betweeen Mental Health and Sleep") +
guides(fill = guide_colorbar(title = "Correlation",
title.position = "top",
title.hjust = 0.5,
label.position = "left",
label.hjust = 0.5))
After investigating national and state-level correlations, we attempt to better understand the degree to which the lifestyle factors (Sleep and LPA) are associated with the health outcomes (Depression and MHLTH) by isolating the effect of each lifestyle factor on each of the health outcomes. We do this by conducting multiple linear regressions (one for each of the health outcomes) while controlling for the effects of each of the independent variables (the lifestyle factors). Doing so enables us to use hypothesis testing to determine whether there are significant linear relationships between the lifestyle factors and the health outcomes. This strategy is superior to our previous correlation testing in two ways: it provides a more refined analysis of the associations between individual variables, and it takes into account the sample size and the number of independent variables to determine whether those associations are statistically significant.
Our hypothesis tests are structured as follows:
MHLTH:
Null Hypothesis: There is no significant linear relationship between independent variables (Sleep and LPA) and the dependent variable (MHLTH). There are no statistically significant regression coefficients.
Alternative Hypothesis: There is a significant linear relationship between the independent variables (Sleep and LPA) and the dependent variable (MHLTH). There is at least one nonzero, statistically significant regression coefficient.
Depression:
Null Hypothesis: There is no significant linear relationship between the independent variables (Sleep and LPA) and the dependent variable (Depression). There are no statistically significant regression coefficients.
Alternative Hypothesis: There is a significant linear relationship between the independent variables (Sleep and LPA) and the dependent variable (Depression). There is at least one nonzero, statistically significant regression coefficient.
Our multiple linear regressions produce some valuable results:
The MHLTH regression output includes nonzero statistically significant regression coefficients for both Sleep (0.2000) and LPA (0.1450). This means that a 1% increase in Sleep is associated with a 0.20% increase in MHLTH and a 1% increase in LPA is associated with a 0.15% increase in MHLTH. Both regression coefficients are statistically significant at the 5% level: Sleep with a p-value of 0.0110 and LPA with a p-value of 0.0149. Because both of these values are below 0.05, we can be 95% confident that there is a positive linear relationship between both of the independent variables (Sleep and LPA) and the dependent variable (MHLTH). Consequently, we reject the null hypothesis for MHLTH. Variations in the lifestyle factors Sleep and LPA both seem to be associated with variations in MHLTH. And of these lifestyle factors, Sleep (0.2000) seems to have a slightly greater association with MHLTH than LPA (0.1450). This regression has an adjusted R-squared value of 0.4120 indicating that after accounting for both the number of observations and the number of variables used, approximately 41% of the variation in MHLTH levels is associated with variation in Sleep and LPA levels. This is very high! Such a result is a valuable finding!
By contrast, the Depression regression output includes nonzero regression coefficients for Sleep (0.0339) and LPA (0.2180), but these coefficients are not statistically significant. Sleep has a p-value of 0.8490 and LPA has a p-value of 0.1110. Neither value is below 0.05 (or even below 0.10), so we cannot be confident that there is a significant linear relationship between the independent variables (Sleep and LPA) and the dependent variable (Depression). Consequently, we fail to reject the null hypothesis for Depression. The effect of Sleep and LPA on Depression seems to be negligible.
# multiple linear regressions
depression.mlg <- lm(Depression ~ Sleep + LPA -1, data = mean_df)
mhlth.mlg <- lm(MHLTH ~ Sleep + LPA -1, data = mean_df)
# mlg tables
depression.mlg.table <- stargazer(depression.mlg,
type = "text",
title = "Multiple Linear Regression for Depression, Sleep, and LPA",
header = FALSE,
digits = 4,
star.cutoffs = c(0.05, 0.01, 0.001),
report = "vcstp*")
mhlth.mlg.table <- stargazer(mhlth.mlg,
type = "text",
title = "Multiple Linear Regression for Mental Health, Sleep, and LPA",
header = FALSE,
digits = 4,
star.cutoffs = c(0.05, 0.01, 0.001),
report = "vcstp*")
# # mlg partial regression plots
# mhlth.lpa.mlg.partial.reg.plot <- qqp(mhlth.mlg, "LPA")
#
# mhlth.sleep.mlg.partial.reg.plot <- qqp(mhlth.mlg, "Sleep")
#
# # mlg coefficient plots
# dep.mlg.coefplot <- coefplot(depression.mlg, exclude = TRUE, title = "Depression")
# dep.mlg.coefplot
#
# mhlth.mlg.coefplot <-coefplot(mhlth.mlg, exclude = TRUE, title = "MHlTH")
# mhlth.mlg.coefplot
#
# # mlg residual plots
# dep.residual.plot <- plot(depression.mlg, which = 1)
# dep.residual.plot
#
# mhlth.residual.plot <- plot(mhlth.mlg, which = 1)
# mhlth.residual.plot
In conclusion, our exploration and testing of the four health variables yields some important insights:
The distributions of all four health conditions are fairly normally distributed (with the possible exception of LPA which may be slightly right-skewed).
At the national-level, there is some degree of positive correlation between all four conditions.
At the state-level, Depression levels are consistently higher than MHLTH levels and Sleep levels are consistently higher than LPA levels.
Sleep and LPA are more correlated with MHLTH than Depression. This suggests that lifestyle factors may have more of an effect on mental health than depression (although, of course, we cannot make a definitive causal statement).
MHLTH correlations vary by geography. Its correlations with LPA are more consistent across states than its correlations with Sleep.
There is a statistically significant linear relationship between independent variables (Sleep and LPA) and MHLTH (dependent variable).
There is no statistically significant linear relationship between independent variables (Sleep and LPA) and Depression (dependent variable).
After controlling for variables, Sleep has a slightly greater and more significant effect on MHLTH than LPA.
In sum, it seems that lifestyle factors influence MHLTH more than Depression, and that of those lifestyle factors, Sleep has a greater influence than LPA. It is worth noting that the MHLTH variable is a measure of individuals’ short-term self-assessed mental health (over the past two weeks) whereas the Depression variable is a measure of individuals’ long-term clinical diagnoses for depression (by a medical professional). So variations in lifestyle factors are associated with variations in less-severe short-term psychological problems but not with variations in severe long-term psychological problems.
This research has several limitations that should be considered. First, the cross-sectional nature of the data prevents the establishment of causal relationships between lifestyle factors and mental health outcomes, as only associations can be inferred. Second, the research is limited to the variables available in the dataset, potentially omitting other relevant factors, such as nutrition, social support, or access to healthcare, which could influence the observed relationships.
Moreover, the study relies on self-reported data for some variables, which may be subject to recall bias or social desirability bias, potentially affecting the accuracy of the results. Additionally, the analysis does not take into account potential confounding variables or interactions between variables that could influence the associations between lifestyle factors and mental health outcomes.
This study suggests that lifestyle factors such as sleep and physical activity may have an impact on mental health outcomes. Further research suggestions include expanding the scope of the analysis to incorporate additional variables such as nutritional habits, social support, access to healthcare, and socioeconomic factors. Longitudinal and time-series data could also be employed to investigate causal relationships between lifestyle factors and mental health outcomes, as well as the impact of the COVID-19 pandemic on these relationships. Examining potential confounding variables and interactions between variables, implementing advanced statistical techniques, and conducting qualitative research could provide deeper insights into the complex relationships between lifestyle factors and mental health outcomes. By incorporating these suggestions, future research could contribute to a better understanding of the impact of lifestyle factors on mental health, particularly in the context of the COVID-19 pandemic.
In conclusion, the analysis suggests that there are significant relationship in the prevalence of depression, poor mental health, lack of sleep, and lack of physical activity across different states in the US, with certain regions of the country experiencing higher rates of these health issues than others. The findings of this analysis can help healthcare professionals and policymakers to identify areas with higher health risks and target interventions accordingly.
Centers for Disease Control and Prevention. (2021, October 18). Measure definitions. Centers for Disease Control and Prevention. Retrieved March 8, 2023, from https://www.cdc.gov/places/measure-definitions/index.html
Centers for Disease Control and Prevention. (n.d.). Places: Local data for better health, Census Tract Data 2022 release. Centers for Disease Control and Prevention. Retrieved March 8, 2023, from https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh